Using the hybrid EMD-BPNN model to predict the incidence of HIV in Dalian, Liaoning Province, China, 2004–2018

Background Acquired immunodeficiency syndrome (AIDS) is a malignant infectious disease with high mortality caused by HIV (human immunodeficiency virus, and up to now there are no curable drugs or effective vaccines. In order to understand AIDS’s development trend, we establish hybrid EMD-BPNN (empirical modal decomposition and Back-propagation artificial neural network model) model to forecast new HIV infection in Dalian and to evaluate model’s performance. Methods The monthly HIV data series are decomposed by EMD method, and then all decomposition results are used as training and testing data to establish BPNN model, namely BPNN was fitted to each IMF (intrinsic mode function) and residue separately, and the predicted value is the sum of the predicted values from the models. Meanwhile, using yearly HIV data to established ARIMA and using monthly HIV data to established BPNN, and SARIMA (seasonal autoregressive integrated moving average) model to compare the predictive ability with EMD-BPNN model. Results From 2004 to 2017, 3310 cases of HIV were reported in Dalian, including 101 fatal cases. The monthly HIV data series are decomposed into four relatively stable IMFs and one residue item by EMD, and the residue item showed that the incidence of HIV increases firstly after declining. The mean absolute percentage error value for the EMD-BPNN, BPNN, SARIMA (1,1,2) (0,1,1)12 in 2018 is 7.80%, 10.79%, 9.48% respectively, and the mean absolute percentage error value for the ARIMA (3,1,0) model in 2017 and 2018 is 8.91%. Conclusions The EMD-BPNN model was effective and reliable in predicting the incidence of HIV for annual incidence, and the results could furnish a scientific reference for policy makers and health agencies in Dalian. Supplementary Information The online version contains supplementary material available at 10.1186/s12879-022-07061-7.


Background
The full name of AIDS is acquired immunodeficiency syndrome, which is a malignant infectious disease with high mortality caused by human immunodeficiency virus infection [1], and has become one of global public health problem [2]. Since the first case reported in China [3], there has been a tendency toward an increase in HIV epidemics in the country [4]. As of September 30 in 2013, a total of 434,000 HIV infections and AIDS patients were reported in China, and the main way of transmission in China was transsexual transmission at present [5]. Up to now, there are no curable drugs or effective vaccines. Therefore, establish forecasting model of HIV, so as to find the development trend of HIV, is of great significance to the HIV prevention and control work. For the epidemic trend of HIV, its influencing factors are complex, including population, economy, behavior and environment. At present, China has not fully carried out the Open Access *Correspondence: anqingyu@163.com 1 Dalian Center for Disease Control and Prevention, Dalian 116021, Liaoning, China Full list of author information is available at the end of the article monitoring and collection of data on HIV-related influencing factors, so it is difficult to establish a prediction model of HIV by analyzing the influencing factors. In recent years, some scholars have used ARIMA model, GM (1,1) model, BP neural network model to predict the incidence trend of HIV. For example, Liang et al. [6] using G (1, 1) modeling method to fit the incidence of HIV in Jiangsu province, the relative error was 23.89%. GM is a mathematical model based on grey system theory, which can systematically predict the trend of variable change. GM (1,1) is the simplest form of model, the basic steps of establishing the model are firstly cumulate the irregular original data to a regular sequence of data and then build the differential equation, so as to predict the future development trend of the disease [7]. Yang et al. [8] using ARIMA to build modeling HIV incidence from 2000 to 2014 in China and the mean absolute percentage error was 19.90%. Wu et al. [9] use Back-propagation neural network (BP-ANN) as a model to predict HIV prevalence, and the ratios of accuracy for training, calibration and detection were 93.94%, 88.48% and 89.60%, respectively.
In order to improve the accuracy of the prediction model, in this study, we took the HIV incidence data from 2004 to 2018 as an example, established a two-stage EMD (empirical mode decomposition)-BPNN (back-propagation artificial neural network model) model to forecast HIV in Dalian and the prediction results will provide the basis for AIDS monitoring and prevention in Dalian.

Study area
Dalian is the main coastal city of Liaoning Province, China and a major tourist city located at 38° 43′-40° 10′ N latitude and 120° 58′-123° 31′ E longitude. It had a registered population of 5.949 million in 2017. Dalian has a warm continental monsoon climate and is in a marine temperate zone. The average annual temperature is 10 ℃ with a maximum of 35 ℃, and a minimum of − 28 ~ − 18 ℃. The average rainfall is 550-800 mm and the total hours of annual sunshine is 2500-2800 h [10].

Data collection
The HIV was a notifiable monitoring communicable disease in China. The clinicians are required to report HIV cases through the China information system for disease control and prevention within 24 h. During 1995 to 2008, there was 320 cases of HIV infection reported in Dalian, and the average annual growth rate of incidence was 31.06% [11]. Yearly incidence rate of HIV during the period of 1999 to 2018 and monthly incidence number of HIV cases in Dalian during the period of 2004 to 2018 was obtained from above information system's statistical report around the beginning of February in the next year. The diagnostic criteria of HIV is compliance with diagnostic criteria of HIV infection by Secretariat of the Professional Committee on Infectious Diseases Standards and Secretariat of the Professional Committee on Parasitic Disease Standards of the Ministry of Health of the People's Republic of China in May, 2010 [12].

Data analysis
The monthly HIV data series are decomposed by EMD method, and then all decomposition results are used as training and testing data to establish BPNN model, namely BPNN was fitted to each intrinsic mode function (IMF) and residue separately, the predicted value is the sum of the predicted values from the models, and the annual incidence equal to the sum of the predicted results of the 12 months. Meanwhile, using monthly HIV data to established BPNN and SARIMA model and using yearly HIV incidence rate to established ARIMA model to compare the predictive ability with EMD-BPNN model.

Empirical modal decomposition
Empirical modal decomposition is a self-adaptive decomposition method which is developed for non-stationary and nonlinear signal processing, was proposed by Huang et al. [13]. After EMD processing, complex signals can be decomposed into several intrinsic mode function (IMF) components based on the local characteristic time scale of the signal from high to low frequency and a residue.
The expression of the model is: where s(t) denote original signals, imfi(t) denote the i th intrinsic mode function, and i = 1, 2, · · · , n,rn(t) denote residual signal. The IMF must meet the following two conditions: 1. In the whole signal data series, the number of extrema points must be equal to the number of Zero or differ by one at most; 2. The mean value of envelope defined by maximum and minimum must be equal and zero [14].
Empirical modal decomposition of monthly incidence data of HIV from 2004 to 2017 was performed by using MATLAB 2014.

BP artificial neural network
A back-propagation artificial neural network is a multilayer feed-forward neural network trained by error back propagation algorithm [15]. Its topological structure consists of three layers, namely input layer, hidden layer and output layer, each of which consists of some joints. The hidden layer connects input and output layer, and their correlations are reflected by relevant coefficients, namely connection weights. Usually, through self-learning or self-training of artificial neural network, repeatedly adjust these coefficients until reaching a goal of welltrained model (Fig. 1).
The learning process of BP algorithm consists of two processes: forward propagation and backward propagation. In the forward propagation process, the input information is transferred from the input layer to the output layer through the hidden layer. If the output layer does not get the desired output, it is transferred to the back propagation. The error signal is returned along the original connection path, and the connection weights between joint in each layer are modified, so the network parameters are adjusted repeatedly to make the minimized error function [16].
BPNN model was performed by using DPS data processing system.

The ARIMA and SARIMA model
A SARIMA model can be described as ARIMA (p, d, q) multiplied by (P, D, Q), where the terms p, d, q represent ordinary components, while P, D, Q represent seasonal components [17]. Simulated the monthly incidence data of HIV from 2004 to 2017 and the yearly incidence rate of HIV from 1999 to 2016 with the Box-Jenkins modeling approach, we fitted the ARIMA model and the SARIMA model to HIV incidence, and then used the fitted model to out-of-sample predict yearly incidence rate of HIV for the year 2017 and 2018 and monthly incidence of HIV for the year 2018, respectively.
Firstly, autocorrelation function (ACF) graph and partial autocorrelation function (PACF) graph were used to identify the possible values for the autoregressive or moving average components. Secondly, estimates of the model's parameters were obtained by the least squares method according to the different values of p, d, q and p, d, q, P, D, Q. Thirdly, compared the models by the Akaike information criterion (AIC), where the preferred model was the one with the lowest AIC value. Finally the goodness of fit of each model was verified by plotting the autocorrelation and partial autocorrelation of residuals and by using the Ljung-Box test. SARIMA and ARIMA model was performed by using 11.5 version of Statistical Package for Social Sciences (SPSS11.5) with a significant level of p < 0.05.

Performance comparison
We used ARIMA, BPNN and the seasonal ARIMA model as single time series prediction method to predict yearly and monthly incidence rate of HIV and compare the predictive ability with EMD-BPNN model.
The criterion for comparing the predictive ability of the models was the absolute percentage error defined as: where xt and xt denote observed and fitted value at time point respectively. Thus, the preferred model is the one with the lowest mean absolute percentage error. Fig. 1 The topological structure of BPNN model

Descriptive analysis
From 2004 to 2017, 3310 cases of HIV were reported (3122 males and 188 females) in Dalian, Liaoning Province, China, including 101 fatal cases. The average incidence rate is 4.01/100,000 population, the average mortality rate is 0.12/100,000 population, and the case fatality rate is 3.05%. The incidence of the HIV is mainly concentrated in the age group of 20 to 49 years old which has 2751 cases of HIV reported and coming to 83.11% in total.

Decomposition results of the original monthly HIV data series by EMD
The original monthly HIV data series from 2004 to 2017 was decomposed by EMD. Four independent IMFs and one residue item were obtained. Figure 2 showed decomposition results of the original monthly HIV data series of Dalian by EMD. As shown in figure, IMF presents the oscillation characteristics in order from high frequency to low frequency. Specifically, IMF1, IMF2 and IMF3 represented high frequency variable and large fluctuation range, IMF4 represented low frequency component, and the fluctuation range was small. The residue item is the last one, which is reflect the overall trend of the original data series. In this study, the residue item showed that the monthly incidence data of HIV in Dalian during2004 to 2015 is increasing and during 2016 to 2017 is declining.

Fitting and forecasting IMFs and residue by BPNN model
In this study, we establish a BPNN model to predict the decompositions IMF1 to IMF4 and the residue item. Because of the monthly data series, we took the every 12 IMF and residue item as the input, the thirteenth IMF and residue item as output. The parameters of model were as follows. We test 12 candidate the node number of one hidden layer from 1 to 12, and then, we calculated fitting residuals to determine the optimal value. Table 1 showed the node number of one hidden layer and the model's fitting residuals. The node value was 12 had the smallest fitting residuals (0.003581%), so we choice the node number of one hidden layer was 12, the minimum training velocity was 0.1, the permissible error was 0.001, the maximum iteration number was 1000, and the values of input nodes were standardization transformation. After 1000 run, IMF1 to IMF4 fitting residuals were 0.003581%, 0.001022%, 0.000354% and 0.000218%, residue item fitting residuals was 0.00132%, and the model fitted well. Table 2 showed the number of predicted cases from January to December 2018 obtained from the EMD-BPNN model, and the annual incidence equal to the sum of the predicted results of the 12 months. The observed and predicted values in 2018 were relatively close to each other, and the absolute percentage error value for the model was 7.80%.

Performance comparison analysis
To understand the performance of the hybrid EMD-BPNN model, the predicted results of the hybrid EMD-BPNN model were compared with BPNN, the seasonal ARIMA and ARIMA.

BPNN
Because of the monthly data series, we took the every 12 month incidences as the input, the thirteenth month incidence as output. The parameters of model were as follows. The node number of one hidden layer was 12, the minimum training velocity was 0.1, the permissible error was 0.001, the maximum iteration number was 1000, and the values of input nodes were standardization transformation. After 1000 run, fitting residuals were 0.002840%, and the model fitted well. Table 3 showed the number of predicted cases from January to December 2018 obtained from the BPNN model, and the annual incidence equal to the sum of the  predicted results of the 12 months. The absolute percentage error for the model was 10.79%.

SARIMA
Firstly, the sequence of HIV incidence from 2004 to 2017 was used to determine the stability of the sequence. According to Fig. 3, the series show non-stationary mean, so it is necessary to stabilize the variance of HIV incidence by computing its natural logarithm. The monthly data used in this study, so the time series was differenced once at the seasonal level and non-seasonal level. The transformed HIV incidence showed far less dispersion and become stationary because of that the number of HIV cases fluctuated around the mean. All further statistical procedures were performed on the transformed HIV incidence. Secondly, because of the time series after one seasonal difference and non-seasonal difference, d equals one and D equals one. The slow decay in the ACF at lags 1-2 (autocorrelation1 = − 0.578, autocorrela-tion2 = 0.222) suggests that non-seasonal q equals to one or two, while also it may be zero, and a PACF cutoff at lag 1 (partial autocorrelation1 = − 0.578) suggests that non-seasonal p equals one. The slow decay in the ACF at lags 11-13 (autocorrelation11 = 0.315, auto-correlation12 = − 0.408, autocorrelation13 = 0.206), and a PACF cutoff at lag 11 (partial autocorrela-tion11 = 0.298) suggests seasonal Q equals zero, one or two and seasonal P equals zero.
The model parameters for SARIMA (1,1,2) (0,1,1) 12 model, the non-seasonal autoregressive parameters is estimated as − 0.839 (SE = 0.142, t = − 5.925, p = 0.000), non-seasonal moving average parameters one is estimated as 0.183 (SE = 0.176, t = 1.036, p = 0.302), non-seasonal moving average parameters two is estimated as 0.614 (SE = 0.168, t = 3.649, p = 0.000) and seasonal moving average parameter is 0.952 (SE = 0.267, t = 3.567, p = 0.000) respectively. Table 4 showed AIC values for the SARIMA models corresponding to different choices of p, q, P and  The Ljung-Box statistic of ARIMA (1, 1, 2) (0, 1, 1) 12 model showed that there were high p-values associated with the statistics. The null hypothesis of independence in this residual time series cannot be rejected (Table 5). The plots of the ACF and PACF of the residuals show no remaining temporal correlation (Figs. 4, 5). Thus it can be concluded that the ARIMA (1, 1, 2) (0, 1, 1) 12 model identified fit the data well. Table 6 showed out-of-sample predicted values obtained from the SARIMA (1,1,2) (0,1,1) 12 model and the results were compared with the observed number of HIV cases in 2018. The absolute percentage error value for the model was 9.48%.

ARIMA
Firstly, the sequence of HIV incidence rate from 1999 to 2016 was used to determine the stability of the sequence. According to Fig. 6, the series show non-stationary mean, so the time series was differenced once at the non-seasonal level. The transformed HIV incidence become stationary because of that the incidence rate of HIV fluctuated around the mean (Fig. 7). All further statistical procedures were performed on the transformed HIV incidence.
Secondly, because of the time series after one non-seasonal difference, d equals one. In the ACF, the coefficients are within the confidence interval (Fig. 7), suggests that non-seasonal q equals zero, and a PACF cutoff at lag 3 (partial autocorrelation1 = − 0.520) suggests that nonseasonal p equals three (Fig. 8).
The Ljung-Box statistic of ARIMA (3, 1, 0) model showed that there were high p-values associated with the statistics. The null hypothesis of independence in this residual time series can't be rejected ( Table 7). The plots of the ACF and PACF of the residuals show no remaining temporal correlation. Thus it can be concluded that the ARIMA (3,1,0) model identified fit the data well.
The predict incidence rates of 2017 and 2018 obtained from the ARIMA (3,1,0) model were 5.569 per 100,000 population and 4.433 per 100,000 population    respectively. Compared to the actual value of 5.266 per 100,000 population and 5.041 per 100,000 population, the average absolute percentage error for the model was 8.91%.

Discussion
As one of the important public health problems in the world, AIDS has caused more and more serious harm to human health. It is one of the three global threats facing human beings, especially in developing countries [18]. According to a new report released by UNAIDS on the eve of World AIDS Day 2021, 1.5 million people were newly infected with HIV in 2020 [19]. The prevention and treatment of AIDS faces great challenges. From 2004 to 2017, 3310 cases of HIV were reported in Dalian, Liaoning Province, China, the average incidence rate is 4.01/100,000 population, the average mortality rate is 0.12/100,000 population, and the case fatality rate is 3.05%. The average incidence rate was lower than national level [20] and higher than the reported average incidence levels of some part of southern city in China [21,22].
The prediction of infectious diseases can detect the trend of disease development in time. Therefore, if an area has continuous HIV surveillance data, the incidence number of HIV can be predicted by establishing mathematical models, and the results could be used to the HIV monitoring and provide the scientific basis for prevention strategies of the area. It makes AIDS prevention and control work more targeted, predictable and initiative. In this study, two stage EMD-BPNN model was established to predict the incidence trend of HIV in Dalian. Firstly, the monthly original incidence data of HIV from 2004 to 2017 was decomposed into a number of component and residue by empirical modal decomposition, then the component and residue items were taken as data sets to establish the neural network model. Two-stage EMD-BPNN advantage lied in compared with the prediction model established by direct application of original data, the complexity of the model can be effectively reduced. Because the disease incidence data often had the characteristics of non-linear and non-stationary, but the components obtained through empirical modal decomposition belonged to the variable set with the same frequency, and the residue term belonged to the increasing or decreasing variable, so it was easy to fit.
The other advantage of EMD is that can visually expressed the overall trend of the sequence. As seen from res curve in Fig. 2 during 2004 to 2015 the monthly incidence of HIV is increasing and during 2016 to 2017the monthly incidence of HIV is declining. This may be with the intensification of monitoring and surveillance in the period from 2004 to 2015, the number of newly discovered past infection increased and the number of new HIV-infected persons continued to increase. But with the implementation of preventive and control measures such as national policy, behavioral intervention, health education, et al., the number of new HIV infections declined since 2016. Although the monthly incidence of HIV dropped slightly in 2016 and 2017, but it did not show a significant downward trend. According to the analysis of the past HIV data in Dalian, sexual transmission was the major transmission route and an increase of prevalence was noticed among MSM [23]. So in order to prevent the AIDS epidemic, it is suggested that the health administration department continue to strengthen the efforts and scope for the high risk behavior intervention among MSM.
In this study, to further assess the prediction performance of the hybrid EMD-BPNN model, the absolute percentage error was utilized to measure performance. According to the comparison of EMD-BPNN, BPNN, SARIMA model for monthly HIV data series and ARIMA model for yearly HIV incidence rate series, the SARIMA model performed better than BPNN, but worse than EMD-BPNN model, and the ARIMA model performed slightly worse than EMD-BPNN model. In BPNN model, Fig. 7 Autocorrelation function (ACF) for HIV incidence rate series with differenced once at the non-seasonal level